feat: initial PoC Substrait consumer #1

davisusanibar · 2023-02-16T01:36:11Z

PoC Java Substrait consumer

C++ Process:

arrow::Status execute_substraitv3() {
    std::cout << "Hello Substrait!" << std::endl;
    ARROW_ASSIGN_OR_RAISE(std::string substrait_json, GetSubstraitJSON());
    std::shared_ptr<arrow::Buffer> shared_buffer = arrow::engine::SerializeJsonPlan(substrait_json).ValueOrDie();
    Result<std::shared_ptr<RecordBatchReader>> reader = arrow::engine::ExecuteSerializedPlan(*shared_buffer);
    struct ArrowArrayStream c_stream;
    arrow::ExportRecordBatchReader(reader.ValueOrDie(), &c_stream);


    // recover values
    std::shared_ptr<RecordBatchReader> new_shared_table = ImportRecordBatchReader(&c_stream).ValueOrDie();
    Result<std::shared_ptr<Table>> new_table = arrow::Table::FromRecordBatchReader(new_shared_table.get());
    std::shared_ptr<Table> new_shared_table_final = new_table.ValueOrDie();
    std::cout << "Values recovered: " << new_shared_table_final->num_rows() << " rows and "
              << new_shared_table_final->num_columns() << " columns" << std::endl;;
    // It prints: Values recovered: 12 rows and 5 columns
}

Java side (Java --> JNI --> C++)

Need to review why on Java side why ArrowReader is populated with column names but not with any data value information.

  @Test
  public void testBaseSubstraitRead() throws Exception {
    try (ArrowArrayStream arrowArrayStream = ArrowArrayStream.allocateNew(rootAllocator())) {
      if (!org.apache.arrow.dataset.substrait.JniWrapper.get().executeSerializedPlan(getSubstraitPlan(), arrowArrayStream.memoryAddress())) {
        System.out.println("No hay nada que mostrar!!!");
      }
      try (ArrowReader arrowReader = Data.importArrayStream(rootAllocator(), arrowArrayStream)){
        System.out.println(arrowReader.getVectorSchemaRoot().contentToTSVString());
        // It prints: Only columns name
        // foo     __fragment_index        __batch_index   __last_in_fragment      __filename
        System.out.println(arrowReader.getVectorSchemaRoot().getSchema());
        // It prints: Schema<foo: Binary, __fragment_index: Int(32, true), __batch_index: Int(32, true), __last_in_fragment: Bool, __filename: Utf8>
      }
    }
  }

github-actions · 2023-02-16T01:36:32Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

davisusanibar · 2023-02-16T18:14:32Z

PoC Java Substrait consumer

C++ Process:

arrow::Status execute_substraitv3() {
    std::cout << "Hello Substrait!" << std::endl;
    ARROW_ASSIGN_OR_RAISE(std::string substrait_json, GetSubstraitJSON());
    std::shared_ptr<arrow::Buffer> shared_buffer = arrow::engine::SerializeJsonPlan(substrait_json).ValueOrDie();
    Result<std::shared_ptr<RecordBatchReader>> reader = arrow::engine::ExecuteSerializedPlan(*shared_buffer);
    struct ArrowArrayStream c_stream;
    arrow::ExportRecordBatchReader(reader.ValueOrDie(), &c_stream);


    // recover values
    std::shared_ptr<RecordBatchReader> new_shared_table = ImportRecordBatchReader(&c_stream).ValueOrDie();
    Result<std::shared_ptr<Table>> new_table = arrow::Table::FromRecordBatchReader(new_shared_table.get());
    std::shared_ptr<Table> new_shared_table_final = new_table.ValueOrDie();
    std::cout << "Values recovered: " << new_shared_table_final->num_rows() << " rows and "
              << new_shared_table_final->num_columns() << " columns" << std::endl;;
    // It prints: Values recovered: 12 rows and 5 columns
}

Java side (Java --> JNI --> C++)

Need to review why on Java side why ArrowReader is populated with column names but not with any data value information.

  @Test
  public void testBaseSubstraitRead() throws Exception {
    try (ArrowArrayStream arrowArrayStream = ArrowArrayStream.allocateNew(rootAllocator())) {
      if (!org.apache.arrow.dataset.substrait.JniWrapper.get().executeSerializedPlan(getSubstraitPlan(), arrowArrayStream.memoryAddress())) {
        System.out.println("No hay nada que mostrar!!!");
      }
      try (ArrowReader arrowReader = Data.importArrayStream(rootAllocator(), arrowArrayStream)){
        System.out.println(arrowReader.getVectorSchemaRoot().contentToTSVString());
        // It prints: Only columns name
        // foo     __fragment_index        __batch_index   __last_in_fragment      __filename
        System.out.println(arrowReader.getVectorSchemaRoot().getSchema());
        // It prints: Schema<foo: Binary, __fragment_index: Int(32, true), __batch_index: Int(32, true), __last_in_fragment: Bool, __filename: Utf8>
      }
    }
  }

Problem was that client was not calling arrowReader.loadNextBatch() to start reading data.

sonatype-lift · 2023-02-17T05:37:46Z

ci/scripts/java_jni_manylinux_build.sh

@@ -45,6 +45,7 @@ export ARROW_ORC
 : ${ARROW_PLASMA:=ON}
 export ARROW_PLASMA
 : ${ARROW_S3:=ON}
+: ${ARROW_SUBSTRAIT:=ON}


SC2223: This default assignment may cause DoS due to globbing. Quote it.

ℹ️ Expand to see all @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage

@sonatype-lift ignore Leave out the above finding from this PR

@sonatype-lift ignoreall Leave out all the existing findings from this PR

@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
_{Click here to add LiftBot to another repo.}

Help us improve LIFT! (Sonatype LiftBot external survey)

Was this a good recommendation for you? _{Answering this survey will not impact your Lift settings.}

[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

…ary)

…on the class path

Co-authored-by: David Li <[email protected]>

…SubstraitConsumer.java Co-authored-by: David Li <[email protected]>

…to read artifact

Co-authored-by: David Li <[email protected]>

…AceroSubstraitConsumer.java Co-authored-by: David Li <[email protected]>

github-actions bot added the Component: Java label Feb 16, 2023

sonatype-lift bot reviewed Feb 17, 2023

View reviewed changes

davisusanibar force-pushed the poc-substrait branch from 7500dd7 to 92cb8d2 Compare March 13, 2023 15:42

github-actions bot added the Component: Documentation label Mar 13, 2023

feat: consume Substrait Plan

a0aac46

davisusanibar force-pushed the poc-substrait branch from 92cb8d2 to a0aac46 Compare March 13, 2023 16:37

davisusanibar added 10 commits March 13, 2023 12:23

fix: solving maven-dependency-plugin

0d91f09

feat: add support for execution of Substrait binary plans also

0599dc2

Upgrade to Java 11 to be able to consume Isthmus library

c794ae5

fix: profile to Java test with JDK11 (be able to consume Isthmus libr…

8cc5443

…ary)

fix: solve error to call Isthmus by Dataset that use JDK8

e5594f8

fix: detected both log4j-over-slf4j.jar AND bound slf4j-reload4j.jar …

223ddef

…on the class path

fix: rollback changes on orc

795e619

Merge branch 'main' into poc-substrait

3bd18f1

fix: able to compile main source with jdk8 and test with jdk11

088a101

fix: able to compile main source with jdk8 and test with jdk11

ba23e44

github-actions bot added the awaiting review label Mar 16, 2023

davisusanibar and others added 8 commits March 16, 2023 06:09

fix: JAVA_HOME_11_X64: command not found

8655815

fix: partial comments fix

d22d6b1

Update java/dataset/src/main/cpp/jni_util.h

f0d8a25

Co-authored-by: David Li <[email protected]>

Update java/dataset/src/main/java/org/apache/arrow/dataset/substrait/…

632f90d

…SubstraitConsumer.java Co-authored-by: David Li <[email protected]>

fix: comments

9437f4e

fix: comments

61d6ee7

fix: comments

64c7607

fix: hash boost_1_81_0 does not match expected value

721fe01

github-actions bot added the Component: C++ label Mar 22, 2023

davisusanibar added 2 commits March 22, 2023 09:27

fix: maven-shade-plugin:jar:3.1.1 -> org.ow2.asm:asm:jar:6.0: Failed …

b3c2e1e

…to read artifact

Merge branch 'main' into poc-substrait

f5596c9

davisusanibar added 2 commits March 28, 2023 13:43

fix: clean sout

8c57c16

fix: rollback maven-shade-plugin

766b383

github-actions bot removed the Component: C++ label Mar 28, 2023

davisusanibar and others added 27 commits March 28, 2023 22:14

fix: failures test

5e8b887

fix: delete methods not needed, create files of substrait plan

7f59fbd

fix: npe read resources

0d2bcf8

fix: add resources files for nosuchfile error

4380932

fix: add resources files for nosuchfile error

9bbe4fb

fix: update rst documentation

5351ee1

Apply suggestions from code review

e966d32

Co-authored-by: David Li <[email protected]>

fix: code review

cfe4061

Merge branch 'main' into poc-substrait

2419896

Merge branch 'main' into poc-substrait

8811bc6

fix: rebase and changes to consider new arrow acero

ead4784

fix: solving PR comments

9bfa15c

Merge branch 'main' into poc-substrait

8a0eae6

fix: solving PR comments

87e75eb

Merge branch 'main' into poc-substrait

812921f

fix: rebase

89060eb

Update java/dataset/src/main/java/org/apache/arrow/dataset/substrait/…

33c634f

…AceroSubstraitConsumer.java Co-authored-by: David Li <[email protected]>

fix: comment on code review

34979a5

fix: comment on code review

1a6f0e5

fix: validate input on arrow Table associated with a given table name

e388be5

fix: code review

8eb3e40

Merge branch 'main' into poc-substrait

ce7800b

Merge branch 'main' into poc-substrait

fdd042b

fix: solve code review comments

3dddea0

fix: solve code review comments

6bdae18

fix: solve code review comments

2d9fc84

fix: solve code review comments

9b5f0cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: initial PoC Substrait consumer #1

feat: initial PoC Substrait consumer #1

davisusanibar commented Feb 16, 2023

github-actions bot commented Feb 16, 2023

davisusanibar commented Feb 16, 2023

sonatype-lift bot Feb 17, 2023

Command	Usage
`@sonatype-lift ignore`	Leave out the above finding from this PR
`@sonatype-lift ignoreall`	Leave out all the existing findings from this PR
`@sonatype-lift exclude <file\|issue\|path\|tool>`	Exclude specified `file\|issue\|path\|tool` from Lift findings by updating your config.toml file

feat: initial PoC Substrait consumer #1

Are you sure you want to change the base?

feat: initial PoC Substrait consumer #1

Conversation

davisusanibar commented Feb 16, 2023

github-actions bot commented Feb 16, 2023

davisusanibar commented Feb 16, 2023

sonatype-lift bot Feb 17, 2023

Choose a reason for hiding this comment